In this project, the variation of trips, cases, and deaths; the distribution of the case mortality rate; and the relationship of different trip distances were investigated. One of the two datasets utilized in this analysis comes from the World Health Organization (WHO), an organization who is consistently compiling COVID-19 global data to show statistics and encourage data exploration. The other dataset originates from the Bureau of Transportation Statistics, who is also consistently measuring the population staying at home and the trips by distance for the population not staying home during the COVID-19 pandemic and stay-at-home order. The analysis was carried out with the utilization of statistical tools, such as measures of location and graphs to show relationships and trends. For the descriptive analysis, the median case mortality rate of a particular year-long range was observed and the linear relationship of cases and deaths were observed. With a multiple linear regression and ANOVA model for our inferential analysis, we were able to show that the effect of a random selection of distances on COVID-19 cases and deaths. For a sensitivity analysis, diagnostic plots were used to assess and confirm our results.
The dataset from the World Health Organization (WHO) consists of daily reported counts of cases and deaths due to COVID-19. We also referenced the Trips by Distance dataset from the Bureau of Transportation Statistics for information on the number of trips taken by distance of the nation during the pandemic. In this data analysis, we are primarily interested in the following questions: what is the relationship of trips with COVID-19 cases and deaths, are there variations between different distances for trips taken outside of the home and with COVID-19 cases and deaths, and how correlated are cases and deaths with different distances. The data analysis will distinguish if trip distances affect COVID-19 cases. This will also indicate if smaller and/or larger distances of travel contributes to a rise of cases and deaths, primarily due to community transmission.
This data analysis utilizes the two following datasets: the WHO COVID-19 Dashboard and the Trips by Distance dataset.
The WHO has consistently collected data from around the world during the COVID-19 pandemic. The WHO dataset reports daily confirmed counts by countries, territories, and areas. Counts starting from December 31, 2019 to March 21, 2020 were taken from International Health Regulations (IHR) and thereafter, the confirmed counts are continuously compiled from global data on WHO dashboards of the headquarters and different regions. Daily counts may vary between countries, territories, and areas due to different sources, cut-off times, case detections, etc. There may also be instances of negative numbers because of the removal of cases or deaths, which are attributed to changes in data reconciliation procedures.
The WHO COVID-19 dataset used in the analysis contains daily updated counts of cases and deaths with the following variables: the reporting date, country code, country, WHO region, new cases count, cumulative cases to date, new deaths count, and cumulative deaths to date.
The Trips by Distance dataset compiles statistics of transportation, including the distance people travel when they are not quarantining during the COVID-19 pandemic and stay-at-home order. The Bureau of Transportation Statistics has reported data to date of mobility statistics. The data is gathered and merged from a mobile device data panel that takes distances traveled by people. The Bureau considers temporal frequency and spatial accuracy for locations as well as temporal coverage and representativeness for devices. All personal mobile data and locations have been excluded from sources for confidentiality. These trips include all types of transportation, including but not limited to airlines, transit, trains, and cars.
The dataset contains daily updated statistics with variables, such as the national count of the population staying at home, the population not staying at home, and the count of trips of various distances from less than one mile to over 500 miles, which are the primary variables of this data analysis.
To observe the relationship of COVID-19 cases and deaths with trips by distance, we merged the two datasets by date. The following is what we will be using to investigate our questions of interest. We will primarily focus on one year of data, starting from February 1,2020 to January 31, 2021, for our analysis.
To visualize the COVID-19 data set, below is an interactive plot to show the changes in new cases and deaths across time, starting from when cases and deaths started to be reported. The slow start of confirmed cases is attributed to the delay of discovery of the coronavirus, delay of political action, and lack of resources in order to confirm cases of COVID-19.
Another plot is included below to show the changes in new cases over time. This can reflect trends of the population with seasons as well as policy changes. It is observed that there was a substantial increase during the summer months. This shows there was an increase in community transmission due to the population leaving their homes more often as well as the changes in policies to improve the economic environment with relaxing certain restrictions for businesses of the stay-at-home order. It is also observed that there was a very large increase during the holiday season, ranging from November to January. Families are more likely to have larger group sizes and more likely to travel to visit family and friends.
The plot below is of new deaths versus new cases. Here, we can observe that although when new cases start to rise, the number of new deaths also starts to rise. The new cases and new deaths have a linear, positive relationship.
Next, we compare the trends of the number trips outside of the home per day, the number of cases, and the number of deaths. In the graph below, we can observe the delay between people taking trips, people showing symptoms and confirming they have COVID-19, and people dying as a percentage of those cases. Because we see this delay, we are able to demonstrate COVID-19’s 14-day incubation period. For example, around mid to late May, there was an increase in the number of trips, which caused an increase in confirmed cases starting early to mid June. This then leads to a general increase in deaths in late July to early August. In order to visualize the trends efficiently, the variables were scaled by factors of 10.
Next, below is a graph that displays the Case Mortality Rate over time. This was calculated by dividing the daily new deaths by the new cases. By the graph, we can immediately observe that the graph has similarities to the graph behavior of both the cases and deaths. On one hand, we primarily see the first half of the case mortality rate graph to mirror the first half of the deaths, which most likely contains outliers due to the inconsistent works of counting cases in the beginning of the pandemic. On the other hand, we observe the second half of the case mortality rate graph to mirror the second half of the cases, which most likely does not contain outliers due to the improved environment of the nation’s ability to detect cases and deaths of COVID-19.
By using the summary function, we can observe the distribution of case mortality rates throughout the year.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.756 1.099 1.564 2.493 2.679 22.322 29
Because the daily death count is more closely related to earlier confirmed cases, the counts of deaths are not necessarily the deaths from the count of confirmed cases of that day. Because of this nature of the dataset, we observe the median case mortality rate. This is to minimize the influence of outliers in the dataset, such as caused by days with an unusually high number of deaths and low number of confirmed cases. The median case mortality rate shows that we expect 1.564% of cases to end in death.
We have chosen to follow a one way ANOVA model for this data set.
\(Y_{ij} = \mu_i + \epsilon{ij}\)
\(Y_{ij}\) represents the outcome of the \(j^{th}\) value of Y for the \(i^{th}\) group.
\(\mu_i\) represents the mean percentage of number trips outside of the home per day.
\(i\) represents the three levels for the different intervals of trips taken.
\(j\) represents the value of trips taken on a specific day in a specific interval of trips taken.
We are interested in whether or not there is statistical evidence the means among the different trip size is equal. If we fail to reject the null hypothesis we will conclude that the amount of trips out of the home during the year did not have a significant impact on the confirmed new cases of Covid-19 or the confirmed deaths resulting from Covid-19.
\(H_0 : \mu_1 = \mu_2 = \mu_3\) \(H_A :\) At least one \(\mu_i\) is not equal
The three independent variables that we will use from this data set are Number_of_Trips_1-3, Number_of_Trips_50-100, Number_of_Trips_250-500. The two dependent variables we will use from this data set are New_cases, and New_deaths. The first three variables report the amount of trips individuals took during February 1, 2020 to January 31, 2021 throughout the United States among the Covid-19 pandemic. The trip numbers are grouped by three different intervals; 1-3 trips, 50-100 trips, and 250-500 trips. The New_cases and New_deaths variables are the amount of new confirmed cases and confirmed deaths during the same time frame. Each independent variable is plotted below to display the distribution compared to the dependent variables
The three main assumptions for the one-way ANOVA model are the following:
The \(Y_{ij}\)’s are required to be randomly sampled
The \(i\) groups are independent and randomly selected from the true population
\(\epsilon_{ij}\) ~ (0,\(\sigma^{2}_{\epsilon}\)) where the errors are independent and should follow a normal distribution.
Model No. 1:
## Analysis of Variance Table
##
## Response: New_cases
## Df Sum Sq Mean Sq F value Pr(>F)
## `Number_of_Trips_250-500` 1 7.9396e+10 7.9396e+10 17.844 3.038e-05 ***
## `Number_of_Trips_1-3` 1 1.3910e+11 1.3910e+11 31.262 4.447e-08 ***
## `Number_of_Trips_50-100` 1 1.0275e+11 1.0275e+11 23.092 2.267e-06 ***
## Residuals 362 1.6107e+12 4.4495e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F test-statistic for all three variables is larger than zero, meaning there is statistical evidence to suggest the group means are not equal for each quantity of trips taken.
Since the p-value for each of the variables is less than our significance level of 0.05, we will reject the null hypothesis and conclude that there is statistical evidence to suggest that the true average quantity of trips taken is not equal.
Model No. 2:
## Analysis of Variance Table
##
## Response: New_deaths
## Df Sum Sq Mean Sq F value Pr(>F)
## `Number_of_Trips_250-500` 1 34260805 34260805 35.8712 5.086e-09 ***
## `Number_of_Trips_1-3` 1 18051827 18051827 18.9003 1.792e-05 ***
## `Number_of_Trips_50-100` 1 1826963 1826963 1.9128 0.1675
## Residuals 362 345748741 955107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the trip sizes of 1-3 and 250-500, the F test-statistics are greater than zero, meaning there is statistical evidence to suggest the group means are not equal for each quantity of trips taken. The F test-statistic for the trip size of 50-100 is very close to 0 meaning there is small evidence that the group mean is different among the other two quantities.
The p-value for the trip size of 50-100 is larger than our significance level of 0.05, therefore we fail to reject the null and conclude that the true average quantity of trips taken is equal. This suggests that taking 50 to 100 trips from the home did not correlate with the amount of deaths from Covid-19 in the last year.
Along with the significant evidence from the F test-statistic that the trip size of 1-3 and 250-500 does have significant evidence to say the means are different is confirmed by both of the p-values being less than that of our significance level of 0.05. This suggests that taking either 1 to 3 trips or 250 to 500 trips from the home in the last year did have a large effect on the total count of confirmed deaths.
To continue to test if our model is the best fitting model, we will calculate the adjusted r-squared value for both models below. The adjusted r-squared value will help tell us which input variable can explain the variation of our output variable. The higher the adjusted r-squared value is, the better fit our model is. We found that our ANOVA output is the exact same for the linear version of our model (shown below), so with this we calculated the adjusted r-squared using the summary function. The value of the adjusted r-squared is quite low for the model with the dependent variable as New_cases, reporting at 0.1594 and for New_deaths, reporting at 0.1282. Seeing we have a low r-squared value for both models, we will continue on and with our tests for the goodness of fit.
Model No. 1 - ANOVA for linear model and adjusted t-squared analysis:
## Analysis of Variance Table
##
## Response: New_cases
## Df Sum Sq Mean Sq F value Pr(>F)
## `Number_of_Trips_250-500` 1 7.9396e+10 7.9396e+10 17.844 3.038e-05 ***
## `Number_of_Trips_1-3` 1 1.3910e+11 1.3910e+11 31.262 4.447e-08 ***
## `Number_of_Trips_50-100` 1 1.0275e+11 1.0275e+11 23.092 2.267e-06 ***
## Residuals 362 1.6107e+12 4.4495e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = New_cases ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` +
## `Number_of_Trips_50-100`, data = newdata_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88756 -39572 -25414 13708 290294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.301e+05 3.443e+04 6.683 8.88e-11 ***
## `Number_of_Trips_250-500` -4.779e-02 8.586e-03 -5.566 5.07e-08 ***
## `Number_of_Trips_1-3` -7.890e-04 1.237e-04 -6.376 5.53e-10 ***
## `Number_of_Trips_50-100` 6.380e-03 1.328e-03 4.805 2.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66700 on 362 degrees of freedom
## Multiple R-squared: 0.1663, Adjusted R-squared: 0.1594
## F-statistic: 24.07 on 3 and 362 DF, p-value: 3.187e-14
Model No. 2 - ANOVA for linear model and adjusted t-squared analysis:
## Analysis of Variance Table
##
## Response: New_deaths
## Df Sum Sq Mean Sq F value Pr(>F)
## `Number_of_Trips_250-500` 1 34260805 34260805 35.8712 5.086e-09 ***
## `Number_of_Trips_1-3` 1 18051827 18051827 18.9003 1.792e-05 ***
## `Number_of_Trips_50-100` 1 1826963 1826963 1.9128 0.1675
## Residuals 362 345748741 955107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = New_deaths ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` +
## `Number_of_Trips_50-100`, data = newdata_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2056.1 -630.2 -233.0 341.7 5019.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.488e+03 5.044e+02 8.897 < 2e-16 ***
## `Number_of_Trips_250-500` -5.411e-04 1.258e-04 -4.302 2.18e-05 ***
## `Number_of_Trips_1-3` -7.289e-06 1.813e-06 -4.021 7.06e-05 ***
## `Number_of_Trips_50-100` -2.690e-05 1.945e-05 -1.383 0.168
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 977.3 on 362 degrees of freedom
## Multiple R-squared: 0.1354, Adjusted R-squared: 0.1282
## F-statistic: 18.89 on 3 and 362 DF, p-value: 2.09e-11
In this section we will start the sensitivity analysis of our model to test how good of a fit our model was to our data. Below, we have several plots for our first model based on the number of new cases observed. These plots are a Residual vs. Fitted plot, a Normal Q-Q plot, a Scale-Location plot, and finally a Residuals vs. Leverage plot.
The Residual vs. Fitted plot shows if the residuals have a non-linear pattern or not. The Normal Q-Q plot shows us if the residuals are normally distributed or not. The Scale-Location plot shows us if the residuals are spread evenly across all of the predictor variables we have. Finally, the Residuals vs. Leverage plot gives us insight into which data points are outliers. This is shown by Cook’s distance. If a data point is outside the dashed line of Cook’s distance, then it is an outlier.
Below are the plots for our first model based on the number of cases:
Starting with the Residuals vs. Fitted plot, we can see that our data follows a roughly linear pattern. It does however slope slightly downwards, thus indicating that it is not entirely linear and that our model doesn’t have a perfect fit.
Next, by looking at the Normal Q-Q plot, we can see that the residuals deviate from the Normal line. This indicates that our sample may be skewed, and that there may be some underlying issues with the data or our model.
Moving on to the Scale-Location plot, we can see that our residuals are clustering and are not evenly spread. This indicates that our data does not have an equal variance, and that there is some skew to our data.
Finally, by taking a look at our Residuals vs. Leverage plot, we can see that there are no data points outside Cook’s distance. This indicates that there are no outliers in our data set, which is consistent with our clustering of data from earlier.
Now let’s take a look at the following plots for our other model based on the number of Covid-19 related deaths:
Looking at the Residuals vs. Fitted plot, we can see that our data follows a more linear pattern than our last model. This shows us that our model is a better fit than the last one, and may indicate that our chosen indicator variables were more accurate.
Next, by looking at the Normal Q-Q plot, we can see that the residuals don’t deviate from the Normal line until you look at the edges of the graph. This tells us that our sample may be slightly skewed, but is much cleaner than the last model. All in all, the residuals follow the normal line, showing that the data is less skewed and more accurate.
By observing our Scale-Location plot, we can see that our residuals are clustering just like the last model, and that they are also not evenly spread. This gives us insight into the lack of equal variance in our model, and that there is some skew to our data, just like the last model.
Finally, by taking a look at our Residuals vs. Leverage plot, we can see that there are no data points outside Cook’s distance in this model either. This indicates that there are no outliers in our data set, which is consistent with our other model as well.
Conclude your analysis with a discussion of your findings and caveats of your approach.
library(dplyr)
library(tibble)
library(tidyverse)
library(ggplot2)
library(car)
library(stringr)
library(readr)
library(plotly)
knitr::opts_chunk$set(fig.pos = 'H')
tripsbydistance <- read_csv("Trips_by_Distance.csv")
names(tripsbydistance)<-str_replace_all(names(tripsbydistance), c(" " = "_"))
WHO_COVID_19_global_data <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
us_covid_data <- filter(WHO_COVID_19_global_data, Country %in% "United States of America")
mergedata = merge(tripsbydistance ,us_covid_data, by.x = "Date", by.y = "Date_reported")
newdata <- mergedata[-c(2:9, 20:21)]
newdata_subset = newdata[-c(1:29, 396:415),]
newdata %>%
filter(Date>= "2020-03-01", Date<= "2021-02-20") %>%
group_by(Date,WHO_region) %>% summarize(Deaths = sum(New_deaths),
Cases = sum(New_cases)) %>%
mutate(Date = Date- as.Date("2020-03-01")) %>%
plot_ly(
x= ~Cases,
y= ~Deaths,
frame = ~Date,
type = 'scatter',
mode = 'markers',
showlegend = T
)
fig2 <- plot_ly(newdata_subset, x = ~Date, y = ~New_cases, name = 'New cases', type = 'scatter', mode = 'lines') %>% layout(title = "New Cases vs Date",
xaxis = list(title = "Date"),
yaxis = list (title = "New Cases"))
fig2
fig3 <- newdata_subset %>%
filter(Date>= "2020-03-01", Date<= "2021-02-20") %>%
ggplot(aes(x=New_cases,y=New_deaths)) +
geom_point()+
geom_text(aes(label = " "), hjust=0, vjust=0) +
ggtitle("New Deaths vs New Cases") +
geom_smooth(method='lm', formula= y~x) +
xlab("New Cases") + ylab("New Deaths")
fig3
fig4 <- plot_ly(newdata, x = ~Date, y = ~New_cases/10, name = 'New cases', type = 'scatter', mode = 'lines', line = list(color = 'rgb(22, 96, 167)') ) %>% layout(title = "Counts vs Date",
xaxis = list(title = "Date"),
yaxis = list (title = "Count"))
fig4 <- fig4 %>% add_trace(y = ~New_deaths*10, name = 'New Deaths', mode = 'lines', line = list(color = 'rgb(205, 12, 24)'))
fig4 <- fig4 %>% add_trace(y = ~((mergedata$Number_of_Trips)*(1/10000)), name = 'Number of Trips', mode = 'lines', line = list(color = I('green')))
fig4
casemortalityrate = (newdata_subset$New_deaths / newdata_subset$New_cases)*100
casemortalityrate[is.infinite(casemortalityrate)] <- NA
fig5 <- plot_ly(newdata_subset, x = ~Date, y = ~casemortalityrate, name = 'trace 0', type = 'scatter', mode = 'lines') %>% layout(title = "Case Mortality Rate",
xaxis = list(title = "Date"),
yaxis = list (title = "Case Mortality Rate"))
fig5
summary(casemortalityrate)
newdata_subset = newdata[-c(1:29, 396:415),]
cases.model = aov(New_cases ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` + `Number_of_Trips_50-100`, data=newdata_subset)
anova(cases.model)
deaths.model = aov(New_deaths ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` + `Number_of_Trips_50-100`, data=newdata_subset)
anova(deaths.model)
# Make our anova model to a linear model to find adjusted R-squared
cases.model2 = lm(New_cases ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` + `Number_of_Trips_50-100`, data=newdata_subset)
anova(cases.model2)
summary(cases.model2)
# Make our anova model to a linear model to find adjusted R-squared
death.model2 = lm(New_deaths ~ `Number_of_Trips_250-500` + `Number_of_Trips_1-3` + `Number_of_Trips_50-100`, data=newdata_subset)
anova(death.model2)
summary(death.model2)
plot(cases.model)
plot(deaths.model)
sessionInfo()
WHO coronavirus disease (COVID-19) dashboard. Geneva: World Health Organization, 2020. Available online: https://covid19.who.int/
Trips by Distance. U.S. Department of Transportation Bureau of Transportation Statistics. Available online: https://data.bts.gov/Research-and-Statistics/Trips-by-Distance/w96p-f2qv
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.9.3 car_3.0-10 carData_3.0-4 forcats_0.5.0
## [5] stringr_1.4.0 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
## [9] ggplot2_3.3.3 tidyverse_1.3.0 tibble_3.0.4 dplyr_1.0.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lattice_0.20-41 lubridate_1.7.9.2 assertthat_0.2.1
## [5] digest_0.6.27 R6_2.5.0 cellranger_1.1.0 backports_1.2.1
## [9] reprex_0.3.0 evaluate_0.14 httr_1.4.2 pillar_1.4.7
## [13] rlang_0.4.10 lazyeval_0.2.2 curl_4.3 readxl_1.3.1
## [17] rstudioapi_0.13 data.table_1.13.6 Matrix_1.2-18 rmarkdown_2.6
## [21] labeling_0.4.2 splines_4.0.3 foreign_0.8-80 htmlwidgets_1.5.3
## [25] munsell_0.5.0 broom_0.7.3 compiler_4.0.3 modelr_0.1.8
## [29] xfun_0.19 pkgconfig_2.0.3 mgcv_1.8-33 htmltools_0.5.0
## [33] tidyselect_1.1.0 rio_0.5.16 fansi_0.4.1 viridisLite_0.3.0
## [37] crayon_1.3.4 dbplyr_2.0.0 withr_2.3.0 grid_4.0.3
## [41] nlme_3.1-149 jsonlite_1.7.2 gtable_0.3.0 lifecycle_0.2.0
## [45] DBI_1.1.0 magrittr_2.0.1 scales_1.1.1 zip_2.1.1
## [49] cli_2.2.0 stringi_1.5.3 farver_2.0.3 fs_1.5.0
## [53] xml2_1.3.2 ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.6
## [57] openxlsx_4.2.3 tools_4.0.3 glue_1.4.2 crosstalk_1.1.1
## [61] hms_0.5.3 abind_1.4-5 yaml_2.2.1 colorspace_2.0-0
## [65] rvest_0.3.6 knitr_1.30 haven_2.3.1